On loade la matrice de reads par échantillon et par gènes obtenue par htseq-count. On supprime les dernières lignes qui correspondent aux reads non identifiés comme des gènes ou comme ambigus.
data = read.table("quantif_all.txt", h = T)
rownames(data) = data$Gene
data = dplyr::select(data, -c("Gene"))
genes = which(!(grepl("__", rownames(data))))
data = data[genes, ]
data = data[, sort(colnames(data))]
kable(head(data))| WT.K1 | WT.K2 | WT.K3 | WT.N1 | WT.N2 | WT.N3 | |
|---|---|---|---|---|---|---|
| AT1G01010.1 | 1 | 5 | 2 | 3 | 6 | 3 |
| AT1G01020.1 | 0 | 0 | 0 | 0 | 1 | 0 |
| AT1G01020.2 | 0 | 0 | 0 | 0 | 0 | 0 |
| AT1G01030.1 | 0 | 4 | 2 | 0 | 0 | 0 |
| AT1G01040.1 | 12 | 15 | 8 | 18 | 14 | 14 |
| AT1G01040.2 | 0 | 0 | 0 | 0 | 0 | 0 |
On rensigne le design et le type d’échantillons.
design = data.frame(row.names = colnames(data), condition = c("untreated", "untreated",
"untreated", "treated", "treated", "treated"), libType = rep("single-end",
6))
kable(design)| condition | libType | |
|---|---|---|
| WT.K1 | untreated | single-end |
| WT.K2 | untreated | single-end |
| WT.K3 | untreated | single-end |
| WT.N1 | treated | single-end |
| WT.N2 | treated | single-end |
| WT.N3 | treated | single-end |
D’après les 2 papiers de review lus (et détails techniques sur https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html) :
RPKM non adaptés pour la DE : Permettent la comparaison entre gènes d’un même échantillon en corrigeant la profondeur de séquençage et la taille des gènes, mais ne permet pas de comparer des échantillons différents. (La somme des counts normalisés en RPKM ne sera pas la même pour deux échantillons). De plus, on compare des gènes identiques uniquement, donc la taille n’est pas importante.
DESeq : la médiane de ratios : On calcule un sizeFactor propre à chaque échantillon pour corriger la profondeur de séquençage. On divise chaque count par la moyenne géométrique des counts de son gène parmis tous les échantillons (Comme si on créait un nouvel échantillon de pseudo-référence. Les ratios dans un même échantillon devraient être similaires entre la plupart des gènes, qui sont en majorité non differentiellement exprimés). Le size factor d’un échantillon est la médiane de ces ratios pour tous les gènes. Pour normaliser, on divise tous les counts d’un échantillon par son sizeFactor. (On ne considère pas dans la médiane les gènes ayant une moyenne géométrique de 0)
TMM :
à voir avec EdgeR.
cds = estimateSizeFactors(cds)
keep <- rowSums(counts(cds)) >= 10
cds <- cds[keep, ]
sizeFactors(cds) WT.K1 WT.K2 WT.K3 WT.N1 WT.N2 WT.N3
0.5807179 1.4086346 0.4817462 1.3604623 1.2840339 1.5661509
par(mfrow = c(1, 2))
heatmap(counts(cds)[sample(rownames(counts(cds, normalized = FALSE)), 500),
], main = "Sans normalisation")heatmap(counts(cds, normalized = TRUE)[sample(rownames(counts(cds, normalized = TRUE)),
500), ], main = "Avec normalisation")The variance seen between counts is the sum of two components: the sample-to-sample variation just mentioned, and the uncertaintyin measuring a concentration by counting reads. The latter, known as shot noise or Poisson noise, is the dominatingnoise source for lowly expressed genes. The former dominates for highly expressed genes. The sum of both, shot noiseand dispersion, is considered in the differential expression inference.Hence, the variancevof count values is modelled as \(v = s\mu + \alpha s^2\mu^2\),
where \(\mu\) is the expected normalized count value (estimated by the average normalized count value),\(s\) is the size factor for the sample under consideration, and \(\alpha\) is the dispersion value for the gene under consideration.
# Analyse d’expression différentielle
On continue maintenant avec DESeq2, qui utilise le même test statistique que DESeq mais avec des améliorations pour les estimations et le GLM. Test : maintenant que les variances ont été estimées, il devient facile de comparer les conditions avec un test, rbinom.
A decrire mieux et parler des modèles, tout ça tout ça.
On construit le dataset, puis on enlève les lignes dont la somme des reads ne dépasse pas 10.
dds <- DESeq2::DESeqDataSetFromMatrix(countData = data, colData = design, design = ~condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
dds$condition <- relevel(dds$condition, ref = "untreated")
dds <- DESeq2::DESeq(dds)
res <- results(dds)
resLFC <- lfcShrink(dds, coef = "condition_treated_vs_untreated", type = "apeglm")
resOrdered <- res[order(res$padj), ]
sum(res$padj < 0.05, na.rm = TRUE)[1] 48
On a peu de gènes DE, seulement 48 avec un seuil à 5% de p-value, et 60 à 10%.
Visu du gènes le plus différentiellement exprimé :
d <- plotCounts(dds, gene = which.min(res$padj), intgroup = "condition", returnData = TRUE)
ggplot(d, aes(x = condition, y = count)) + geom_point(position = position_jitter(w = 0.1,
h = 0)) + scale_y_log10(breaks = c(25, 100, 400))On essaie de voir ce qui pourrait être étrange dans les données pour avoir si peu de gènes :
[1] 1169
En prenant une condition sur le log2FoldChange on a beaucoup plus de gènes DE. Affichage en heatmap des gènes les plus différentiellement exprimés.
par(mfrow = c(1, 2))
thr = 0.05
res = na.omit(res)
DEgenes = rownames(res[abs(res$log2FoldChange) > 1, ])
mat = counts(dds, normalized = TRUE)[DEgenes, ]
heatmap(mat, main = "abs(Log2FoldChange) > 1 ")DEgenes = rownames(res[res$padj < thr, ])
mat = counts(dds, normalized = TRUE)[DEgenes, ]
heatmap(mat, main = "Significatifs")DEgenes_pval = rownames(res[res$padj < thr, ])
DEgenes_upreg = rownames(res[res$log2FoldChange > 1, ])
DEgenes = intersect(DEgenes_pval, DEgenes_upreg)
mat = counts(dds, normalized = TRUE)[DEgenes, ]
heatmap(mat, main = "Juste les significatifs uprégulés")On s’intéresse maintenant à la liste de gènes sélectionnés et à leur ontologie.
mart = useMart(biomart = "plants_mart", host = "plants.ensembl.org", dataset = "athaliana_eg_gene")
results <- getBM(filters = "ensembl_transcript_id", attributes = c("ensembl_transcript_id",
"description", "external_gene_name"), values = DEgenes, mart = mart)
rownames(results) = results$ensembl_transcript_id
r = results[DEgenes, ]
kable(r, caption = "Genes DE et leur GO")| ensembl_transcript_id | description | external_gene_name | |
|---|---|---|---|
| AT1G05020.1 | AT1G05020.1 | Clathrin coat assembly protein AP180 [Source:UniProtKB/Swiss-Prot;Acc:Q9ZVN6] | AP180 |
| AT1G07520.1 | AT1G07520.1 | GRAS family transcription factor [Source:TAIR;Acc:AT1G07520] | |
| AT1G08090.1 | AT1G08090.1 | High-affinity nitrate transporter 2.1 [Source:UniProtKB/Swiss-Prot;Acc:O82811] | NRT2.1 |
| AT1G14890.1 | AT1G14890.1 | Plant invertase/pectin methylesterase inhibitor superfamily protein [Source:UniProtKB/TrEMBL;Acc:F4HXW0] | |
| AT1G15550.1 | AT1G15550.1 | Gibberellin 3-beta-dioxygenase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q39103] | GA3OX1 |
| AT1G24290.1 | AT1G24290.1 | AAA-type ATPase family protein [Source:UniProtKB/TrEMBL;Acc:O48696] | |
| AT1G35560.1 | AT1G35560.1 | Transcription factor TCP23 [Source:UniProtKB/Swiss-Prot;Acc:Q9LQF0] | TCP23 |
| AT1G37130.1 | AT1G37130.1 | Nitrate reductase [NADH] 2 [Source:UniProtKB/Swiss-Prot;Acc:P11035] | NIA2 |
| AT1G43710.1 | AT1G43710.1 | SDC1 [Source:UniProtKB/TrEMBL;Acc:A0A178WA97] | SDC |
| AT1G50750.1 | AT1G50750.1 | Plant mobile domain protein family [Source:TAIR;Acc:AT1G50750] | |
| AT1G64185.1 | AT1G64185.1 | At1g64185 [Source:UniProtKB/TrEMBL;Acc:Q8LGF9] | |
| AT1G64190.1 | AT1G64190.1 | 6-phosphogluconate dehydrogenase, decarboxylating 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9SH69] | PGD1 |
| AT1G68670.1 | AT1G68670.1 | Transcription factor HHO2 [Source:UniProtKB/Swiss-Prot;Acc:Q8VZS3] | HHO2 |
| AT1G77760.1 | AT1G77760.1 | Nitrate reductase [Source:UniProtKB/TrEMBL;Acc:A0A178WBR8] | NIA1 |
| AT1G78060.1 | AT1G78060.1 | Probable beta-D-xylosidase 7 [Source:UniProtKB/Swiss-Prot;Acc:Q9SGZ5] | BXL7 |
| AT1G78090.1 | AT1G78090.1 | Trehalose-phosphate phosphatase B [Source:UniProtKB/Swiss-Prot;Acc:Q9C9S4] | TPPB |
| AT1G80440.1 | AT1G80440.1 | F-box/kelch-repeat protein At1g80440 [Source:UniProtKB/Swiss-Prot;Acc:Q9M8L2] | |
| AT2G15620.1 | AT2G15620.1 | Ferredoxin–nitrite reductase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q39161] | NIR1 |
| AT2G17820.1 | AT2G17820.1 | Histidine kinase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SXL4] | AHK1 |
| AT2G27510.1 | AT2G27510.1 | Ferredoxin [Source:UniProtKB/TrEMBL;Acc:A0A178VWS0] | FD3 |
| AT2G30760.1 | AT2G30760.1 | Uncharacterized protein At2g30760 [Source:UniProtKB/TrEMBL;Acc:O49341] | |
| AT2G35637.1 | AT2G35637.1 | other RNA [Source:TAIR;Acc:AT2G35637] | |
| AT2G36570.1 | AT2G36570.1 | Leucine-rich repeat receptor-like protein kinase PXC1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJQ1] | PXC1 |
| AT2G36580.1 | AT2G36580.1 | Pyruvate kinase [Source:UniProtKB/TrEMBL;Acc:A0A178VX82] | |
| AT2G39210.1 | AT2G39210.1 | At2g39210/T16B24.15 [Source:UniProtKB/TrEMBL;Acc:O80960] | |
| AT2G39880.1 | AT2G39880.1 | Transcription factor MYB25 [Source:UniProtKB/Swiss-Prot;Acc:O04192] | MYB25 |
| AT2G41440.1 | AT2G41440.1 | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT2G41470.1); Ha. [Source:TAIR;Acc:AT2G41440] | |
| AT3G07350.1 | AT3G07350.1 | F21O3.6 protein [Source:UniProtKB/TrEMBL;Acc:Q9SRT1] | |
| AT3G19430.1 | AT3G19430.1 | Late embryogenesis abundant protein-related / LEA protein-like protein [Source:UniProtKB/TrEMBL;Acc:A0A178VGJ3] | |
| AT3G47520.1 | AT3G47520.1 | Malate dehydrogenase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9SN86] | MDH |
| AT3G47980.1 | AT3G47980.1 | At3g47980 [Source:UniProtKB/TrEMBL;Acc:Q84MC3] | |
| AT3G48360.1 | AT3G48360.1 | BTB/POZ and TAZ domain-containing protein 2 [Source:UniProtKB/Swiss-Prot;Acc:Q94BN0] | BT2 |
| AT3G49940.1 | AT3G49940.1 | LOB domain-containing protein 38 [Source:UniProtKB/Swiss-Prot;Acc:Q9SN23] | LBD38 |
| AT3G50750.1 | AT3G50750.1 | BEH1 [Source:UniProtKB/TrEMBL;Acc:A0A178VA20] | BEH1 |
| AT4G31910.1 | AT4G31910.1 | Brassinosteroid-related acyltransferase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SZ58] | BAT1 |
| AT4G36850.1 | AT4G36850.1 | PQ-loop repeat family protein / transmembrane family protein [Source:UniProtKB/TrEMBL;Acc:Q94AH7] | |
| AT4G38340.1 | AT4G38340.1 | Protein NLP3 [Source:UniProtKB/Swiss-Prot;Acc:Q9SVF1] | NLP3 |
| AT5G01215.1 | AT5G01215.1 | other RNA [Source:TAIR;Acc:AT5G01215] | |
| AT5G01215.2 | AT5G01215.2 | other RNA [Source:TAIR;Acc:AT5G01215] | |
| AT5G04840.1 | AT5G04840.1 | At5g04840 [Source:UniProtKB/TrEMBL;Acc:Q8L5X9] | |
| AT5G09800.1 | AT5G09800.1 | U-box domain-containing protein 28 [Source:UniProtKB/Swiss-Prot;Acc:Q9LXE3] | PUB28 |
| AT5G13110.1 | AT5G13110.1 | Glucose-6-phosphate 1-dehydrogenase 2, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9FY99] | G6PD2 |
| AT5G14760.1 | AT5G14760.1 | L-aspartate oxidase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q94AY1] | AO |
| AT5G51830.1 | AT5G51830.1 | Probable fructokinase-7 [Source:UniProtKB/Swiss-Prot;Acc:Q9FLH8] | |
| AT5G57770.1 | AT5G57770.1 | Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region [Source:TAIR;Acc:AT5G57770] | |
| AT5G63160.1 | AT5G63160.1 | BTB/POZ and TAZ domain-containing protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9FMK7] | BT1 |
| AT5G65980.1 | AT5G65980.1 | Protein PIN-LIKES 7 [Source:UniProtKB/Swiss-Prot;Acc:Q9FKY4] | PILS7 |
| AT5G67420.1 | AT5G67420.1 | LOB domain-containing protein 37 [Source:UniProtKB/Swiss-Prot;Acc:Q9FN11] | LBD37 |